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Abstract 


Over the past few years, trace regression models have received considerable 
attention in the context of matrix completion, quantum state tomography, and 
compressed sensing. Estimation of the underlying matrix from regularization- 
based approaches promoting low-rankedness, notably nuclear norm regulariza¬ 
tion, have enjoyed great popularity. In the present paper, we argue that such 
regularization may no longer be necessary if the underlying matrix is symmet¬ 
ric positive semidefinite (spd) and the design satisfies certain conditions. In 
this situation, simple least squares estimation subject to an spd constraint may 
perform as well as regularization-based approaches with a proper choice of the 
regularization parameter, which entails knowledge of the noise level and/or tun¬ 
ing. By contrast, constrained least squares estimation comes without any tuning 
parameter and may hence be preferred due to its simplicity. 


1 Introduction 


Trace regression models of the form 

yi=iY{XjT.*) + ei, i = I,...,n, (I) 

where S* € jg parameter of interest to be estimated given measurement 

matrices Xi € R™ixm 2 observations yi contaminated by errors £i, i = 1,... ,n, 
have attracted considerable interest in high-dimensional statistical inference, machine 
learning and signal processing over the past few years. Research in these areas has 
focused on a setting with few measurements n mi • m 2 and S* being (at least 
approximately) of low rank r ^ min{mi, m 2 }. Such setting is relevant, among others, 
to problems such as matrix completion liISS], compressed sensing IZHU], quantum 
state tomography [14] and phase retrieval |9] . A common thread in these works is the 
use of the nuclear norm of a matrix as a convex surrogate for its rank [22] in regularized 
estimation amenable to modern optimization techniques. This approach can be seen 
as natural generalization of £i-norm (aka lasso) regularization for the standard linear 
regression model [ 55 ] that arises as a special case of model © in which both E* 
and the measurement matrices are diagonal. It is inarguable that in general 

regularization is essential if n < mi • m 2 . However, the situation is less clear if E* 
is known to satisfy additional constraints that can be incorporated in estimation. 
Specifically, in the present paper we consider the case in which mi = m 2 = m and 
E* is known to be symmetric positive semidefinite (spd), written as E* G S™ with S™ 
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denoting the positive semidefinite cone in the space of symmetric real-valued m x m 
matrices S™. The set deserves specific interest as it includes covariance matrices 
and Gram matrices in kernel-based learning methods [24) . It is rather common for 
these matrices to be of low rank (at least approximately), given the widespread use of 
principal components analysis and low-rank kernel approximations [33] . In the present 
paper, we focus on the usefulness of the spd constraint for estimation. We argue that if 
E* is spd and the measurement matrices obey certain conditions, constrained 

least squares estimation 
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( 2 ) 

may perform similarly well in prediction and parameter estimation as approaches em¬ 
ploying nuclear norm regularization with proper choice of the regularization parameter, 
including the interesting regime n < Sm, where Sm = dim(§'") = m{m + l)/2. Note 
that the objective in (I2|) only consists of a data fitting term and is hence convenient to 
work with in practice since one does not need to choose any parameter. Our findings 
can be seen as a non-commutative extension of recent results on non-negative least 
squares estimation for high-dimensional linear regression with non-negative parame¬ 
ters [20l[25]. In these papers it is shown that for certain design matrices, non-negative 
least squares can achieve comparable performance to £i-norm regularized estimation 
with regard to prediction, estimation and support recovery, thereby generalizing prior 
work [IIIIISI] on sparse recovery of a non-negative vector in a noiseless setting. 

Related work. Model (U]) with S* G S™ has been studied in several recent papers. 
A good deal of these papers consider the setup of compressed sensing according to 
which the matrices {Xi}f^i can be chosen by the user, with the goal to minimize the 
number of observations required to (approximately) recover S*. 

In [32) . the problem of exactly recovering S* being low-rank from noiseless observations 
(ci = 0, i = 1,..., n) by solving a linear feasibility problem over the positive semidefi¬ 
nite cone is considered, which is equivalent to the proposed least squares problem o 
in a noiseless setting. Apart from the fact that we primarily study a noisy setting, we 
shall argue below that in the setup of compressed sensing the measurement matrices 
studied in [32) constitute an unfavourable choice relative to those recommended in the 
present paper. 

In [To], recovery from rank-one measurements is considered, i.e., for C K™ 

Ui = xjT,*Xi + Ei = tr(Xj^E*) -I- Ei, with Xi = XixJ, i = 1,..., n. (3) 

As opposed to m. where estimation based on nuclear norm regularization is pro¬ 
posed, the present work is devoted to regularization-free estimation. While rank-one 
measurements as in ((Sj) are also in the center of interest herein, our framework is not 
limited to this specific case. 

In [^, rank-one measurements are considered for general E* G Specializing 

to E* G S™, the authors discuss an application of (l3|) to covariance matrix estimation 
given only one-dimensional projections {xJ of the data points, where the {zi}f^i 

are i.i.d. from a distribution with zero mean and covariance matrix E*. In fact, when 
using observations yi = {xJZi)'^, one obtains 

{xJ Zi)"^ = xJ ZizJ Xi = xJ 'E*Xi + Ei, with Ei = xJ {zizj - T,*}xi, i = 1,. ..,n. (4) 

On the other hand, in (5, no specific attention is given to the spd constraint: the convex 
program proposed therein, which can be seen as a modification of the approach in [10) , 
applies to general symmetric matrices and does not enforce positive semidefiniteness. 

Specializing model (I3|) further to the case in which also E* = a*{a*)^ has rank one. 
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one obtains the quadratic model 


y^ = \xja*\^+ £, (5) 

which (with complex-valued a*) is relevant to the problem of phase retrieval [T7] that 
has received some attention recently. The approach of [5] treats as an instance 
of ® and uses nuclear norm regularization to enforce rank-one solutions. In follow¬ 
up work [^, the authors show a refined recovery result stating that imposing an spd 
constraint — without regularization — suffices. A similar result has been proven inde¬ 
pendently by [12]. However, the results in both [6] and [12] only concern model ®. 
In [I8j . E* is assumed to be a complex Hermitian positive semidefinite matrix of unit 
trace, which is the setting in quantum state tomography. While the setting as well 
as the measurement matrices under consideration are different from ours, a notable 
point of contact to our work can be seen in the fact that the negative von Neumann 
entropj0, which is the proposed regularizer in [18] . does not promote low rankedness, 
but constitutes one possible way of enforcing positive definiteness. At the same time, 
adaptivity of the approach to low rankedness is established in |18| . 

Outline and contributions of the paper. In Section|2l we study statistical proper¬ 
ties of constrained least squares estimation in small sample {n < 6m) and low-rank 
settings. Specifically, we introduce certain geometric conditions associated with the 
measurements {Xi}2—i that allow us to derive non-asymptotic upper bounds on the 
prediction and estimation error indicating that 0 can achieve competitive perfor¬ 
mance while being regularization-free. On the other hand, we show that without 
extra conditions on the measurements the performance of (O can be as poor 

as that of unconstrained least squares. Section [3] contains numerical results based on 
synthetic and real world data that support or complement our theoretical results. Our 
findings are briefly summarized in Section [d] The appendix contains the proofs. 

Notation. We here gather notation and terminology used throughout the paper. For 
an integer d > 1, let denote the Euclidean vector space of real d x d matrices 
with inner product (M, M') := tr(M^M'), M,M' € M"^. The set of real symmetric 
dx d matrices is a subspace of of dimension 6d := d{d + l)/2. Each element M 
of has an eigen-decomposition M = UAU^ = ^ where Ai(M) = 

Amax(AI) > \ 2 {M) > ... > Xd{M) = Amin(Af) is the sequence of real eigenvalues 
with corresponding orthonormal eigenvectors A = diag(Ai(M),..., Ad(M)), 

and U = [ui ... Ud\. Eor q S [l,oo], can be endowed with a norm given by 

the mapping M ||M||g := called the Schatten-g-norm. In 

particular, for q = 1 we speak of the nuclear norm, while q = 2 yields the Frobenius 
norm IHIf. We set ||M||oo := maxi<j<d |Aj(M)|, the spectral norm of M. We denote 
by Si{d,) = {M G : \\M\\q = 1} the Schatten-1-norm unit sphere and set {d) = 
Si{d) n S[|., where = {M € : v^Mv > 0 Vv € K.'^} is the positive semidefinite 
cone in The symbols are understood with respect to the semidefinite 

ordering, e.g. M ^ M' means that M' — Mg For v G and q G [l,oo], ||r!||q 
denotes the usual q-norm. For set A, i? and a real number a, aA := {aa,a G A}, 
A — B = {a — b,a G A,b G B}, and for A,B C dist(A, B) := minag^_f,eB||a — &|| 2 - 

It is convenient to re-write model o as 

y = X{T.*)+£, 

where y = {yi)2=i, £ = {£i)2=i ^ is a linear map defined by {X{M))i = 

tr(Al'^M), i = 1,..., n, referred to as sampling operator. Its adjoint X* : K." —>■ M'” 
is given by the map v i—>■ 

^The von Neumann entropy of a positive definite Hermitian matrix is given by the entropy of its 
eigenvalues 
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2 Analysis 


Preliminaries. Throughout this section, we consider a special instance of model o 
in which 

= tr{X,T.*) + Ei, where G S™, and ' iV(0, a^), i = 1,..., n. 

( 6 ) 

The assumption that the errors follow a Gaussian distribution is made for 

convenience as it simplifies the stochastic part of our analysis, which could be extended 
to cover error distributions with sub-Gaussian tails. 

Note that without loss of generality, we may assume that the are symmetric. 

In fact, any M G M'" can be decomposed as 

M = where and ~ ^ 

2 2 

denote the Euclidean projections of M onto S'" and its orthogonal complement (the 
subspace of skew-symmetric matrices), respectively. Accordingly, since E* G S'", we 
have tr(ME*) = tr(M®y“E*). 

In the sequel, we study the statistical performance of the constrained least squares 
estimator 

E G argmin ^lly-A’(E)||| (7) 

egS™ 2n 

under model ([51) with respect to prediction and estimation. More specifically, under 
certain conditions on X, we shall derive bounds on 

(a) -||A(E*)-A(E)||2, and (b) ||E-E*||i, (8) 

n 

where (a) will be referred to as “prediction error” below. 

The most basic method for estimating E* is ordinary least squares (ols) estimation 

E°^® G argmin -^\\y - A’(E)||2, (9) 

EgSm 2n 

which is computationally much simpler than 0. While obtaining © requires tech¬ 
niques from convex programming, it is straightforward to compute ([9]) by solving a 
linear system of equations in 6m = m(rn + l)/2 variables. On the other hand, the 
prediction error of ols scales as Op(dim(range(A’))/n), where dim(range(A’)) can be 
as large as minjn, dm}, in which case the prediction error vanishes asymptotically only 
if Sm/n —>■ 0 as n —oo. Moreover, the estimation error ||E°'® — E*||i is unbounded un¬ 
less n> Sm- Research conducted over the past few years has consequently focused on 
methods that deal successfully with the situation n < <5^ if the target E* possesses ad¬ 
ditional structure, notably low-rankedness. Indeed, if E* has rank r <C to, the intrinsic 
dimension of the problem becomes (roughly) mr <g; Sm- Rank-constrained estimation 
or regularized estimation with the matrix rank as regularizer yield computationally 
intractable optimization problems in general. In a large body of work, nuclear norm 
regularization, which can be seen as a convex surrogate of rank regularization, is con¬ 
sidered as a computationally convenient alternative for which a series of adaptivity 
properties to underlying low-rankedness has been established, e.g. [a na Eu EH Eg. 
Complementing ([9]) with nuclear norm regularization gives rise to the estimator 

E^ G argmin ^\\y - A’(E)||^ -b A||E||i, (10) 

EgS”* 2n 

where A > 0 is a regularization parameter. In case an spd constraint is imposed (na 
becomes 

G argmin -^Wy - A’(E)|l 2 -b Atr(E). (11) 
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Our analysis aims at elucidating potential advantages of the spd constraint in the 
constrained least squares problem ([7]) from a statistical point of view. It turns out 
that depending on properties of X, the behaviour of E can range from a performance 
similar to the least squares estimator on the one hand to a performance similar 
to the nuclear norm regularized estimator E^"*" with properly chosen/tuned A on the 
other hand. The latter case appears to be remarkable inasmuch as E may enjoy similar 
adaptivity properties as nuclear norm regularized estimators even though E is obtained 
from a pure data fitting problem without any explicit form of regularization. 


2.1 Negative results 

We first discuss examples of X for which the spd-constrained estimator E does not 
improve (substantially) over the unconstrained estimator E°*®. At the same time, these 
examples provide some clues on conditions that need to be imposed on X to achieve 
substantially better performance. 

Example 1: equivalence of constrained and unconstrained least squares 

Let m be even and consider measurement matrices of the form 

A = ~' 

' [ 0 -A, 

for matrices A^ € i= 1,..., n. For E G arbitrary, we can partition 



where En is the top m/2 x m/2 block of E etc. We have 

tr(AiE) = tr(Ai(Eii - E 22 )), i = 1,..., n. 

Hence E enters the least squares objective ([2]) via the difference of the top and bottom 
m/2 X m/2 blocks. Since for any dimension d 

{E - E' : Eg §+, E' G S([} = = {E - E' : E G E' G 

the spd constraint becomes vacuous and can be dropped from ©• 

Example 2: Orthonormal design 

The following statement indicates that for orthonormal design, the prediction error 
of E cannot be expected to improve over that of E°^® by substantially more than a 
constant factor 1/2. 

Proposition 1. Let E* = 0 so that y = e, let n = dm and let {Ai}i<j<, 5 ^ be an 
orthonormal basis ofS"^. Then, ||A’(E)|||/n ^ in probability as m,n ^ 00 . 

By contrast, it is desired that ||A’(E)|||/n = op(l) as m,n —>■ 00 . 

Example 3: Random Ganssian design 

Consider the Gaussian orthogonal ensemble (GOE) of random matrices 
GOE(m) = {A = {xjk)i<j,k<m, {x,,}™ 1 ■ A'(0,1), 

{Xjk = Xkj}l<j<k<m ■ A^(0,1/2)}. 

Random Gaussian measurements are common in compressed sensing-type settings, see 
It is hence of interest to study measurements Xi ‘ ^ ' GOE(m), i = 1,..., n, 
in the context of the constrained least squares problem ©■ The following statement, 
which follows from results in [5] , points to a serious limitation associated with the use 
of such measurements. 
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Proposition 2. Consider measurements Xi GOE(m), i = Then, for 

any e > 0, z/n < (1 — with probability at least 1 — 32 exp(— there exists 

A e S!p, A ^ 0 such that X{A) = 0. 

Proposition [5] has the following implications. 

• If the number of measurements drops below one half of the ambient dimension 
6rm estimating E* based on 0 becomes ill-posed; the estimation error ||E—E*||i 
is unbounded, irrespective of the rank of E*. 

• Geometrically, the consequence of Proposition [5] is that the convex cone Cx = 
{z € M" : z = A’(A), A € S™} contains 0. Unless 0 is contained in the boundary 
otCx (we conjecture that this event has measure zero), this means that Cx = K.", 

i.e., the spd constraint becomes vacuous. 

Remarks. 

1. In [22], the following noiseless analog to the constrained least squares problem 
0 is considered: 

find E € S™ such that A’(E) =y = A’(E*), (12) 

where Xi ~ GOE(m), i = 1,... ,n. The authors prove that for all f G (0,1), 
there exists a G (0,1) so that if n > aSm, E* is the unique solution of the 
feasibility problem (TT^ as long as rank(E*) < ^m. While this implies that the 
spd constraint allows undersampling (i.e., n < (5m), it is not clear to what extent 
undersampling is possible, i.e., how small a could possibly be. Proposition |2] 
yields that a cannot be smaller than 1/2. 

2. It is of interest to relate Proposition|2]to corresponding results on the vector case 
(equivalent to having diagonal {Xi}^^.^ and diagonal E*) in [T2]. Gompared to 
Proposition 121 the corresponding result in m applies to a much wider class of 
random measurement matrices including all random matrices with i.i.d. entries 
from a symmetric distribution around zero. It is thus natural to ask whether 
Proposition 12] holds more generally for all Wigner matrices [27] . 

3. The fact that the threshold ^Sm for the number measurements in Proposition [2] 
equals (up to the scaling factor a'^) the asymptotic prediction error of Example 
2 is not a coincidence; this is part of a wider phenomenon as pointed out in [2]. 
In the framework of [2], ^Sm is the “statistical dimension” of S™. 

2.2 Slow rate bound on the prediction error 

We now turn to the first positive result on the spd-constrained least squares estimator E 
under an additional condition on the sampling operator X. Specifically, the prediction 
error will be bounded as 

i||A’(E*)-A’(S)||2 = 0(Ao||E*||i + A2), where Aq =-||T*(e)|U (13) 

n n 

with Ao typically being of the order 0{y^m/n) (up to logarithmic factors). The rate in 
CU) can be a significant improvement of what is achieved by E°^® if ||E*||i = tr(E*) is 
small. If Ao = o(||E*||i) that rate coincides with those of the nuclear norm regularized 
estimators ([T0|l . (fTT|l with regularization parameter A > Ao, cf. Theorem 1 in [22]. Eor 
nuclear norm regularized estimators, the rate 0(Ao||E*||i) is achieved for any choice 
of X and is hence slow in the sense that the squared prediction error only decays at 
the rate instead of n~^. Therefore, we refer to (1131) as “slow rate bound”. 
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Condition on X. In order to arrive at a suitable condition to be imposed on X so 
that m can be achieved, it makes sense to re-consider Example 3 to identify possible 
obstacles. Proposition [5] states that as long as n is bounded away from <5^/2 from 
above, there is a non-trivial A G such that X{A) = 0. Equivalently, 

dist(P;f,0)= min ||d:’(A )||2 = 0, where 

AeSj** (m) 

Vx := {z G R" : z = X{A), A G 5^(w)}, and (m) := {A G : tr(A) = 1}. 

In this situation, it is in general not possible to derive a non-trivial upper bound 
on the prediction error as dist(7^;t'j 0 ) = 0 may imply that Cx = R” in which case 
||A’(S*) — df(E)||| = ||£:|||. To rule this out, the condition dist(P;t',0) > 0 appears to 
be a natural requirement. More strongly, one may ask for the following: 

There exists a constant r > 0 such that Tq(X) := min —||A’(A )||2 > r^. (14) 

Aes+im) n 

This condition is sufficient to obtain a slow rate bound in the vector case, cf. Theorem 
1 in |25j . However, the condition required for the slow rate bound in Theorem [T] below 
is somewhat stronger than (Ill- 

Condition 1. 

There exist constants i?* > 1 and r* > 0 such that where for i? G R 

t‘^{X,R) =dist^{RVx,'Px)/n= min -\\X{A) - X{B)\\l. 

AgRS+( m) ^ 

BeS+(m) 

It follows from 

t‘^{X,R)= min -\\X{A) - X{B)\\l 

AGB5+(m)BeS+(m) «" ' ' 

< min -\\X{R-A) - X{A)\\l ( 15 ) 

AGS+(m) n 

= {R-lf min -\\XiA)\\l = {R-l)^T^{X) 

AeS+{m) n 

that Condition [T] is in fact stronger than m- Below, we provide a sufficient condition 
on X that implies Condition [TJ 

Proposition 3. Suppose that there exists a G R", ||a ||2 < 1, and constants 0 < < 

0 max such that 

Amin(n“^/^A’*(a)) > (/fmin, and Ainax(n“^/^T’*(a)) < (^max- 

Then for any f > 1, X satisfies Condition\^with i?* = and rf = {(, — l)^(/>niax- 

The condition of Proposition [3] can be phrased as having a positive definite matrix in 
the unit ball of the range of T”*, which, after scaling by 1 / y/n, has its smallest eigenval¬ 
ues bounded away from zero and condition number bounded from above. As a simple 
example, suppose that Xi = y/nl. Invoking Proposition [3] with a = (1,0,..., 0)^ and 
C = 2, we find that Condition [T] is satisfied with i?* = 2 and = 1. A more interesting 
example is random design where the {Xi}f^i are (sample) covariance matrices, where 
the underlying random vectors satisfy appropriate tail or moment conditions. 

Corollary 1. Let tt^ be a probability distribution on R'” with second moment matrix 
P := [zz^] satisfying A„iin(r) > 0. Consider the random matrix ensemble 

M{-Km,q)=\^^^Zkzl, {^fclLl • (16) 
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Suppose that M{Tr^,q) and let f„ := i Ya=i 0 < e„ < Amin(r). 

Under the event {||r —r„||oo < e„}, X satisfies Condition\l\with 

^(Aniax(i j + Cnj j 2 /\ i ^2 

^ /p\ (ITld — (Amax(r) H“ 67;,) . 


It is instructive to spell out Corollary [T] with as the standard Gaussian distribution 
on R™. The matrix r„ equals the sample covariance matrix computed from N = n - q 
samples. It is well-known (see e.g. [TT]) that for m,N large, Amax(r„) and Ai„in(r„) 
concentrate sharply around {1 + r]nfi and (I — rjn)^, respectively, where rjn = . 

Hence, for any 7 > 0, there exists C'.y > 1 so that if iV > it holds that i?* < 2 -I- 7 . 
Similar though weaker concentration results for ||r — r„||oo are available for the broad 
class of distributions having finite fourth moments |30| . When specialized to g = 1, 
Corollary[T]yields a statement about X made up from random rank-one measurements 
Xi = zz^^ i = 1 ,..., n, cf. ©. The preceding discussion indicates that Condition [T] 
tends to be satisfied in this case. 


Main result of this subsection. We are now in position to state the following 
theorem. 


Theorem 1. Suppose that model ([6]) holds with X satisfying Condition\l\ with con¬ 
stants i?* and rf. We then have 


1 


||T(E*) - T(E)||1 < max 2(1 + i?,)Ao||E*||i, 2Ao||E*||i + 8 Aq 






where, for any fi > 0, with probability at least 1 — ( 2 m) ^ 


Aq < cr\/(l + M) 21 og( 2 m) —, where = 


i=l 


Remarks. 

1. Under the scalings i?* =0(1) and rf = U(l), the bound of Theorem [T] is of the 
order 0(Ao||E*||i -|- Aq) as announced in (fT^ at the beginning of this section. 

2. For given X, the quantity t‘^{X,R) can be evaluated by solving a least squares 
problem with spd constraints. Hence it is feasible to check in practice whether 
Condition [T] holds. In fact, the bound of Theorem [I] can be replaced with 

mnmax|2(l + i?)Ao||E*||i, 2Ao||E*||i + 8 (^Aq-^^) 

3. For later reference, it is of interest to evaluate the term for Ai{TTm,q) with 
TTm as the standard Gaussian distribution. It is proved in Appendix [FI that with 
high probability, it holds that 

< (1 + q~^^^ + ^/m/{nq)^ (^ + ^Jm/q+ v^4(m/g) logn^ = 0 (m log n) 
as long as m = 0{nq). 


2.3 Bound on the estimation error 

In the previous subsection, we did not make any assumptions about E* apart from 
E* £ Henceforth, we suppose that E* is of low rank 1 < r <C m and study 













the performance of the constrained least squares estimator o for prediction and 
estimation in such setting. 

Preliminaries. Let E* = UAU^ be the eigenvalue decomposition of E*, where 


C/|| C/x 

A.J- 0rx(m—r) 

m X r m X (m — r) 

0(m—r)xr 0(m—r)x(m—r) 


where is diagonal with positive diagonal entries. Consider the linear subspace 

T-^ = {M G S™ : M = U±AUl, A G S™"’'}. 

From Uj'S*U± = 0, it follows that E* is contained in the orthogonal complement 

T = {M G S™ : M = Ui\B + B^U^, B G 

which has dimension mr — r[r — l)/2 <C <5^ if r m. The image of T under X is 
denoted by T = {z G K” : z = X{M), AI G T}. 


Conditions on X. We now introduce the key quantities the bound in this subsection 
depends on. 

Separability constant. 


t 2 (T) 


idist^ {r,Vx ), Vx ■■= {z : z = X{A), A G T-^ n5+(m)} 


min 

GgT, AG5+(m)nT 


111^(0)-A’(A)|| 


2 

2 


Restricted eigenvalue. 


4>\T) 


min 

Ot^A^T 


ll^(A)||i/n 

l|A||? 


As indicated by the following statement concerning the noiseless case, for bounding 
||E — E*||, it is inevitable to have lower bounds on the above two quantities. 


Proposition 4. Consider the trace regression model o with Si = 0, i = 1.,... ,n. 
Then 

argmin;:^||A’(E*) - A’(E)||^ = {E*} for all E* G T n S? 
sgs™ 2n 

if and only if it holds that t^(T) > 0 and 4>^{T) > 0. 


Correlation eonstant. Moreover, we make use of the following the quantity. It is not 
yet clear to us whether control of this quantity is intrinsically required, or whether its 
appearance in our bound is for merely technical reasons. 

/r(T) =max|-(A’(A),A’(A')) : || Alh < 1, A G T, A'G5+(m)nT*^ 
n 


We are now in position to provide a bound on ||E — E*||i. 

Theorem 2. Suppose that model © holds with E* as eonsidered throughout this 
subseetion and let Aq be defined as in Theorem{l\ We then have 


|E-E*||i < maxjSAo- 
8Ao 


A^(T) / 

'3 m(T) 

2(T)(/)2(T) 1 

,2 cp^T) 

1 

O 

00 

J ’ t^(T) I 


>2(T) r2(T) 
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Remark. Given the above bound on ||E — S*||i, it is possible to obtain an improved 
bound on the prediction error scaling with Aq in place of Aq, cf. m in Appendix |E1 

The quality of the bound of Theorem [2] depends on how the quantities t^(T), (tP'iT) 
and ^(T) scale with n, m and r, which is highly design-dependent. Accordingly, the 
estimation error in nuclear norm can be non-finite in the worst case and O(Aor) in the 
best case. 

• The quantity t^(T) is specific to the geometry of the constrained least squares 
problem © and hence of critical importance. For instance, it follows from 
Proposition [2] that for standard Gaussian measurements, 'r^(T) = 0 with high 
probability once n < ^Sm- The situation can be much better for random spd 

measurements (HU) as exemplified for measurements Xi = ZizJ with Zi 
N{0,1) in the subsequent section. Specifically, it turns out that t^(T) = 
as long as n = r2(m • r). 

• It is not restrictive to assume that the quantity </>^(T) is positive. Indeed, without 
that assumption, even an oracle estimator based on knowledge of the subspace 
T would fail. Reasonable sampling operators X have rank min{n, 5m} so that 
the nullspace of X only has a trivial intersection with the subspace T as long as 
n > dim(T) = mr — r(r — l)/2. 

• For fixed T, computing /i(T) entails solving a biconvex (albeit non-convex) op¬ 
timization problem in the variables A G T and A' G iSj*" (m) fl . Alternating 
optimization (also known as block coordinate descent) is a practical approach 
to such optimization problems for which a globally optimal solution is out of 
reach. In this manner we explore the scaling of t^(Tr) numerically as done for 
t^(T). We find that /r(T) = 0{5m/n) so that /i(T) = 0(1) apart from the regime 
njSm 0, without ruling out the possibility of undersampling, i.e. n < Sm- 


3 Numerical results 


In this section, we provide a series of empirical results regarding properties of the 
estimator S. In particular, its performance relative to regularization-based methods 
is explored. We also present an application to spiked covariance estimation for the 
GBGL face image data set and stock prices from NASDAQ. 


3.1 Scaling of the constant t^(T) 

For X and T given, it is possible to evaluate 'r^(T) by solving a convex optimization 
problem. This is different from other conditions employed in the literature such as 
restricted strong convexity [21] , 1-RIP [10] or restricted uniform boundedness [5] that 
involve a non-convex optimization problem even for fixed T. 

We here consider sampling operators with random i.i.d. measurements Xi = ZizJ , 
where Zi ^ N{0,I) is a standard Gaussian random vector in R"* (equivalently, Xi 
follows a Wishart distribution) , i = 1,..., n. We expect t^(T) to behave similarly for 
random rank-one measurements of the same form as long as the underlying probability 
distribution has finite fourth moments, and thus for (a broad subclass of) the ensemble 
M{7Tm,q) dM])- 

In order to explore the scaling of t^(T) with n, m and r, we fix to G {30, 50, 70,100}. 
For each choice of to, we vary n = aSm, where a grid of 20 values ranging from 0.16 to 

1.1 is considered a. For r, we consider the grid {1, 2,..., to/ 5}. For each combination 
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Figure 1: Scaling of logr^(T) in dependence of r (horizontal axis) and a = njbm 
(colors/symbols). The solid lines represent the fit of model (|T7l) . Note that the curves 
are only fitted to those points for which t^(T) exceeds 10“®. Best seen in color. 


of m, n, and r, we use 50 replications. Within each replication, the subspace T is 
generated randomly from the eigenspace associated with the non-zero eigenvalues of 
a random matrix G^G, where the entries of the m x r matrix G are i.i.d. A^(0,1). 
The results point to the existence of a phase transition as it is typical for problems 
related to that under study [2]. Specifically, it turns out that the scaling of t^(T) can 
be well described by the relation 

r^(T) Ri max{l/r-6>m,„,0}, (17) 

where 4>m,n,0m,n > 0 depend on m and n. In order to arrive at model (flT)) . we first 
obtain the 5%-quantile as summary statistic of the 50 replications associated with each 
triple (n, m, r). At this point, note that the use of the mean as a summary statistic is 
not appropriate as it may mask the fact that the majority of the observations are zero. 
For each pair of (n,m), we then identify all values of r for which the corresponding 
5%-quantile drops below 10“®, which serves as effective zero here. For the remaining 
values, we ht model 113 using nonlinear least squares (working on a log scale). Figure 
[T] shows that model (flTll provides a rather accurate description of the given data. Con¬ 
cerning (prn,n and 9m,m the scalings 4>m,n = and 0m,n = dom/ri for constants 

((> 0 , 6*0 > 0 appear to be reasonable. This gives rise to the requirement n > 9o{mr) for 
exact recovery to be possible in the noiseless case (cf. Proposition |4|) and yields that 
r^(T) = n(l/r) as long as n = 

3.2 Comparison with regularization-based approaches 

In this subsection, we empirically evaluate ||S — S*||i relative to regularization-based 
methods proposed in the literature. 


Setup. We consider Wishart measurement matrices as in the previous subsection. 
Again, we expect a similar behaviour for (most) other random designs from ensemble 
■M-iiTm, q)- We fix m = 50 and let n € {0.24,0.26,..., 0.36,0.4,..., 0.56} • and 
r € {1,2,..., 10} vary. For each configuration of n and r, we consider 50 replications. 
In each of these replications, we generate data 

y* = tr(WS*)-I- fTEi, cr = 0.1, i = l,...,n, (18) 

where S* is generated as the sum of r Wishart matrices and the {ediLi i-i.d. N (0,1). 
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Figure 2: Average estimation error (over 50 replications) in nuclear norm for fixed 
m = 50 and certain choices of n and r. In the legend, “LS” is used as a shortcut 
for “least squares”. Chen et al. refers to (fTOl) . indicates an oracular choice of the 
tuning parameter, “oracle” refers to the ideal error ary/m/n. Standard error bars are 
not shown as the standard errors are of negligible magnitude. Best seen in color. 


Regularization-based approaches. We compare S to the corresponding nuclear 
norm regularized estimator in dill) . Regarding the choice of the regularization parame¬ 
ter A, we consider the grid A*-{0.01,0.05,0.1,0.3,0.5,1, 2,4, 8,16}, where A* = ayjm/n 
as recommended in m and pick A so that the prediction error on a separate validation 
data set of size n generated from ()18|) is minimized. Note that in general, neither a 
is known nor an extra validation data set is available. Our goal here is to ensure that 
the regularization parameter is properly tuned. In addition, we consider an oracular 
choice of A where A is picked from the above grid such that the performance measure 
of interest (the distance to the target in the nuclear norm) is minimized. We also 
compare to the constrained nuclear norm minimization approach of Chen et al. [10] 


12 




































given by 

nnntr(E) subject to E ^ 0, and ||?/— d:’(E)||i < A. (19) 

For the parameter A, we consider the grid na^^j'K- {0.2, 0.3,..., 1,1.25}. This specific 
choice is motivated by the observation that E[||y — T’(E*)||i] = E[||£:||i] = noyJYpK. 
Apart from that, tuning of A is performed as for the nuclear norm regularized estima¬ 
tor. In addition, we have assessed the performance of the approach in [2, which does 
not impose an spd constraint but adds one more constraint to the formulation (1191) . 
That additional constraint significantly complicates optimization of the problem and 
yields a second tuning parameter. Therefore, instead of doing a grid search over a 2D- 
grid, we use fixed values as specified in [5] given the knowledge of a. The results are 
similar or worse than those of (fTOl) (note in particular that positive semidefiniteness 
is not taken advantage of in the approach of [2) and are hence not reported here. 

Discussion of the results. We can conclude from Figure [2] that in most cases, 
the performance of the constrained least squares estimator does not differ much from 
that of the regularization-based methods with careful parameter tuning, which are 
not too far from the oracle. However, for larger values of r, the constrained least 
squares estimator seems to require slightly more measurements to achieve competitive 
performance. 


3.3 Real data examples 

We conclude this section by presenting an application to recovery of spiked covariance 
matrices, a notion due to [Si- 

Background. A spiked covariance matrix is of the form E* = 

where r m and A^cr^ > 0, j = 1,..., r. Note that for data following the 

factor model 

r 

( 20 ) 

i=i 

for orthogonal factors {fj}j-i and random coefficients a^- ^ N{0^Xj) independent 
from the population covariance matrix i = 1 ,... ,n, is of the form given 

above. Model (EOl) is one possible way to motivate principal components analysis 
(PCA); this connection explains the relevance and the popularity of spiked covariance 
models. 

Extension to the spiked case. So far, we have assumed that the target E* is of low 
rank, but it is straightforward to extend the proposed approach to the case in which 
E* is spiked as long as cr^ is known or an estimate is available. A constrained least 
squares estimator of E* takes the form E -|- a^I, where 

E e argmin^lly-A’(E-HCT^/)||i. (21) 

Egg™ 2n 

Data sets. (1) The CBCL facial image data set [1] consist of A^ = 2429 images of 
19 X 19 pixels (i.e., m = 361). We take E* as the sample covariance matrix of this 
data set. It turns out that E* can be well approximated by E^, r = 50, where E^ is 
the best rank r approximation to E* obtained from computing its eigendecomposition 
and setting to zero all but the top r eigenvalues. (2) We construct a second data set 
from the daily end prices of to = 252 stocks from the technology sector in NASDAQ, 
starting from the beginning of the year 2000 to the end of the year 2014 (in total 
N = 3773 days, retrieved from finance.yahoo.com). We take E* as the resulting 
sampling correlation matrix and choose r = 100 . 

Experimental setup. As in all preceding measurements, we consider n random 
Wishart measurements for the operator A, where n = C(mr), where C ranges from 
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Figure 3: Average reconstruction errors logj^pHS — S*||_f in dependence of nj[mr) and 
the parameter /3. “oracle” refers to the best rank r-approximation Standard errors 
are one order of magnitude smaller, and are thus omitted. 


CBCL 


/3 

1 

1 

.4 

.4 

.08 

C 

2 

6 

4 

6 

10 


|S-S*||jr 

< 3 

< 2 

4 

3 

5 


Er-S*||F 


NASDAQ 


P 

1 

1 

1 

1 

C 

1 

2 

3 

6 


|e-s*IIf 

< 3.5 

< 2 

< 1.3 

< 1.1 


Sr-S*||F 


Table 1: Average reconstruction errors relative to for some selected values of /3 and 
nj {mr). 


0.25 to 12. Since ||Sr —E*||i?/||E*||i^’ « 10“^ for both data sets, we work with cr^ = 0 in 
(j21l) for simplicity. To make the problem of recovering E* more difficult, we introduce 
additional noise to the problem by using observations 

2/i = tr(AiS'i), i = l,...,n, (22) 

where Si is an approximation to E* obtained from the sample covariance respectively 
sample correlation matrix of PN data points randomly sampled with replacement from 
the entire data set, z = 1,..., n, where /3 ranges from 0.4 to 1/N {Si is computed from 
a single data point). For each choice of n and /3, 20 replications are considered. The 
reported results are averages over these replications. 

Results. For the CBCL data set, it can be seen from Figure [3] and Table [U that E 
accurately approximates E* (within a factor of three of the best rank-r approximation 
Er) once the number of measurements crosses 2mr. Performance degrades once addi¬ 
tional noise is introduced to the problem by using measurements (I22p that are taken 
from a perturbed version of E*. Even under signihcant perturbations (/3 = 0.08), 
reasonable reconstruction of E* remains possible, albeit the number of required mea¬ 
surements increases accordingly. In the extreme case /3 = 1/iV, the error is still 
decreasing with n, but millions of samples seems to be required to achieve reasonable 
reconstruction error (for computational reasons, we stop at n = 12mr Ri 216, 000). 
The general picture is similar for the NASDAQ data set, but the difference between 
using measurements based on the full sample correlation matrix on the one hand and 
approximations based on random subsampling (1221) on the other hand are more pro¬ 
nounced. For /3 = 1, the reduction in error with increasing n progresses visibly faster 
as for the first data set, and a smaller error relative to E^. close to 1 is achieved. 
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4 Conclusion 


In this paper, we have investigated trace regression in the situation that the underlying 
matrix is symmetric positive semidefinite. We have shown that under certain restric¬ 
tions on the design, the constrained least squares estimator enjoys excellent statistical 
properties similar to methods employing nuclear norm regularization. This may come 
as a surprise, as regularization is widely regarded as necessary in small sample set¬ 
tings. On the application side, we have pointed out the usefulness of our findings for 
recovering spiked covariance matrices from quadratic measurements. 
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A Proof of Proposition [1] 


By rotational invariance of the Gaussian distribution of e, it suffices to consider the 
canonical orthonormal basis of given by 


Xi — eieJ, X 2 — -^(eiej + e2ej),... + CmeJ), X^+i — e2ej, 


X. 


1 


1 


m+2 


— —^(6263 -|- 6362 ),.■■, X^^-i — + eme^_i), Xs^ — e 


V 2 ^ 


V 2 ^ 


where {ej}JLi denote the canonical basis vectors of K'". Equivalently, the correspond¬ 
ing map X : S™ —>■ equals the symmetric vectorization operator 


s — (cTj-fe) (cTll, '/2ai2, . . . , V^CTim, 0-22, V^Cr 23 , ■ ■ ■ , •\/ 2 cr(m-l)m: 


(23) 


Accordingly, denote by {ejk}i<j<k<m the error terms corresponding to the entries 
Wjk}i<j<k<m- The minimization problem ([7]) can hence be expressed as 


. 1 
mm — 
2 n 


^33) T V2ajk) 

3=1 j<k 


. 1 

= mm — 
sesip 2 n 






(24) 


where the matrix E = X*{e) has entries Ejj = ejj, j = 1,... ,m, and Ejk = Ejkl'/^., 
j, fc = 1,..., m, j ^ k. Now observe that the minimizer E of (1241) coincides with the 
Euclidean projection of E on S™. It is well-known [3] that the projection of a sym¬ 
metric matrix on the positive semidefinite cone is obtained by setting all its negative 
eigenvalues to zero, i.e., in terms of the eigendecomposition of A = ^3 '^J^ 

we have 

m 

E = max{Aj(E), 0}ujuj. 
f=i 

At this point, we note that FI is a Wigner matrix, whose empirical distribution of its 
eigenvalues follows Wigner’s semicircle law as m —> 00 (cf. [27]), which is symmetric 
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around zero. Consequently, we have 


m ^ 

||A’(E)||2 = ||E||| = ^{A,(£;),0}2 ^ -WE^f 
i=i 



in probability as to —>■ c». 


B Proof of Proposition [2] 


The proof of Proposition [5] follows from results in [5] . 

Definition B.l. Let C C 6e a convex cone. The statistical dimension of C is 
defined as 5{C) = E[||nc 5 ll 2 ]) where He denotes the Euclidean projection onto C and 
the entries of g are i.i.d. N(0,1). 

Theorem B.l. Let f : > ]RU{—oo, + 00 } be a proper convex function. Suppose 

that A G has i.i.d. N(Q,1) entries, and let zq = Axq for a fixed Xg G R'^. 

Consider the convex optimization problem 

minimize f{x) subject to Ax = zg. (25) 

and let 'D{f,xg) = Ut>oi^ ^ ■ fi^o + tv) < f{xg)} denote the descent cone of 

f at Xg. Then, for any e > Q, if n < (1 — £)5{'D{f,xg)), with probability at least 
1 — 32exp(— xg fails to be the unique solution of (l25l) . 


Proof. (Proposition [2]). Denote by svec : S'" —>■ R^™ the symmetric vectorization map 
(cf. (051) 1. which is an isometry with respect to the Euclidean inner product on S™ 
and R'^™, and by svec“^ : R^™ —>■ S'" its inverse. We can then apply Theorem IB. II to 
the setting of Proposition [5] by using 


d = Sm, X = svec(S), Xg = 0, f{x) 


is™(svec ^(x)), 


A = 


svec(Xi) 

svec(X„)^ 


where is the convex indicator function of S™ which takes the value 0 if its argument 
is contained in S!p and +00 otherwise. Observe that 'D{f,0) = S™. It is shown in 
[2], Proposition 3.2, that the statistical dimension S(§!f) = i5m/2. This concludes the 
proof. □ 


C Proof of Proposition [3] 


Proposition [3] follows from the dual problem of the convex optimization problem as¬ 
sociated with t^{X,R). Below, it will be shown that the Lagrangian dual of the 
optimization problem 


min^||T(Al)-T(i?)||2 

A,B 

subject to A ^ 0, B PO, tr(A) = R, tr(i3) = 1. 

is given by 

max 9 ■ R — S 

9,S,a 

subject to h 01, A SI, ||a ||2 < 1. 

vn 


(26) 


(27) 
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The assertion of Proposition [ 3 ] follows immediately from ( 1371 ) by identifying 9 = 
^Tnin{n~^^^X*{a)) and S = In the remainder of the proof, du¬ 

ality of ( 1331 ) and ( 1371 ) is established. Using the shortcut X = X jy/n, the Lagrangian 
of the dual problem ( 1371 ) is given by 

L{e, 5,a-A,B,K) = e-R-5 + {x*{a) - 91, - (^X*{a) - SI, - «;(||a||^ - 1). 

Taking derivatives w.r.t. 9, 5, r and the setting the result equal to zero, we obtain from 
the KKT conditions that a primal-dual optimal pair {9,d,a, A, B,k) obeys 

tr(A) = i?, tr(i?) = l, X{A) - X{B)-K2a = 0. (28) 

Taking the inner product of the rightmost equation with a, we obtain 

(a,XiA)-XiB))-1i2\\m = 0. 

^ (^X*(a),A-S'j-K2\\a\\l = 0. 

0 tr(A) — 5 tr(i?) —/?2||a||2 = 0. 

9R-S = K2\\a\\l, 

where the second equivalence is by complementary slackness. Consider first the case 
9R — S > 0. This entails It > 0 and thus ||a||| = 1, so that 2k = 9R — S. Substituting 
this result into the rightmost equation in (1281) and taking norms, we obtain 

9R-6 = \\X{A) - X{B)h = ^II^(A) - XiB)h. (29) 

y/n 

For the second case, note that 9R — S cannot be negative as a = 0 is feasible for dlZl). 
Thus, 9R — (5 = 0 implies that a = 0 and in turn also (|29|) . 

D Proof of Corollary [1] 


The corollary follows from Proposition [3] by choosing a = Ijy/n so that n ^l'^X*(a') = 
using that ||r - r„||oo < e™ implies that |Aj(r) - Aj(r„)| < e„, 
j = 1,... ,m ([H], §4.3). The specific values of i?* and are obtained by choosing 
^ = 2 in Proposition [31 


E Proof of Theorem [T] 


The following lemma is a crucial ingredient in the proof. In the sequel, let A = E — E*. 
Let the eigendecomposition of A be given by 

mm m 

A = \j{A)ujuJ = max{0, Xj{A)}ujuJ -f- min{0, \j{A)}ujuJ = A+ -|- A“ 
i=i i=i 

= :A+ =:A- 

(30) 

Lemma E.l. Consider the decomposition (13(11) . lUe have ||A”||i < ||E*||i. 

Proof. Write A+ = and A“ = U-A-Uj for the eigendecompositions of A+ 
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and A , respectively. Since E A 0, we must have tr(E[/_17T) > 0 and thus 

0 < tr(E17_C/T) = tr(C/TEC/_) 

= tr(C/T(S* + A)C/_) 

= tr(C/l(S* + U+K+Uj + U-K-Ul)U-) 

= tr(S*{7_{7T) + tr(A_), 

where for the last identity, we have used that UjU- = 0. It follows that 

||A-||i = ||A_||, = -tr(A_) < tr(S*C/_C/T) < ||S*||i||C/_C/T||oo = ||S1|i. 

□ 


Equipped with Lemma IE. 11 we turn to the proof of Theorem [TJ 


Proof. (Theorem[T|) By definition of S, we have \\y — A’(E )||2 < \\y — A’(E*)|| 2 . Using 
((6|l and the definition of A, we obtain after re-arranging terms that 


i||X(A)||^<i(.,X(A)) = i(A-(.),A) 

I||.r(A)||i<2||j;-(£)/n|U||A||, = 2A„(||A+||, + ||A-||.), (31) 


where we have used Holder’s inequality, the decomposition of A as in Lemma lE.il and 
Ao = IIA’*(e)/n||oo- We now upper bound the l.h.s. of (l3T]) by invoking Condition [T] 
and Lemma IETI which yields ||A“||i < ||E*||i. If ||A+||i < i?*||A“||i, we have 

i||A’(E) - A’(S*)||2 = 111^(^)112 < 2{R, + l)Ao||S*||i, 
n n 


which is the first part in the maximum of the bound to be established. In the opposite 
case, suppose first that ||A“||i > 0 (the case ||A“||i = 0 is discussed at the end of this 
proof) and we have ||A+||i/||A“||i = i? > i?* > 1. Consequently, 


-||A’(A)||2 = -||A’(A+)-A’(-A- 


= IIA 


- I|2 
1 


A" 


A+ 

1^ 

1 , 


- T 


-A- 


>||A-||f min -\\X{A)-X{B)r 2 
A^RS+{m) n 


= r2(A’,i?)||A-||2 = r2(A’,i?) 


l|A+||f 


2 


2 


Inserting this into dST]), we obtain the following upper bound on ||A+||i. 


A|A||a+||;< 2 A„:^||A+||, 


||A+||i < 2Ao®^ < 4Ao-^ 

t^X,R) t^X,R) 


Rl 

< 4Ao^, 


where the last inequality follows from the observation that for any i? > i?* 

T^iX,R) > (i?/i?,)V(A’,i?,)i 
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which can be easily seen from the dual problem (071) associated with R). Sub¬ 

stituting the above bound on ||A+||i into (I5T|) and using the bound ||A“||i < ||S*||i 
yields the second part in the maximum of the desired bound. To finish the proof, 
we still need to address the case ||A~||i = 0. Recalling the definition of the quantity 
Tq (T) in (ITTl) . we bound 

i||A(A)||2 = i||A(A+)||2>ro^(T)||A+||?. 

Inserting this into (l3T]) . we obtain from (IT^ 

Back-substitution into (OTl) yields a bound that is implied by that of Theorem [TJ This 
concludes the proof. □ 


Bound on Xq. The bound on Aq is an application of Theorem 4.6.1 in |29j . 

Theorem E.l. JSyf Consider a sequence of fixed matrices in and let 

{£i}?=i * Af(0, cr^). Then for all t > 0 




i=l 


> t < 2mexp(-tV(2crV^)), := 




Choosing t = aV+ ^)2\og{2m) yields the desired bound. 


F Proof of Theorem [T], Remark 3 


The bound hinges on the following concentration result for the extreme eigenvalues of 
the sample covariance of a Gaussian sample. 

Theorem F.l. JlfJ / Let zi,...,zn be an i.i.d. sample from and letVpf = 

^ X]i=i ■ ^6 then have for any d > 0 

P >(l + S+ \ < eM-NSy2). 


In the proof, we also make use of the following fact. 
Lemma F.l. Let {Xi}^^^ C S™. Then 


E^' 

2=1 


< max ||Ai|l^ 

l<2<n 


E^* 

2=1 


Proof. First note that for any v € and any M G ST, we have that 


v^M^v = ^ Af(M)(nJn)2 < A„,ax(M) ^ Aj(M)(nJn)2 = \\M\\^v^Xv, 
i=i i=i 

where {ujjJLi are the eigenvectors of X. Accordingly, we have 




max Xfv < max llAdloo max XiV 

Ikll2 = l ^ l<*<n" ||„|U = 1 ^ 


= max llAillc 

l<2<n 


E^* 

i=l 


□ 
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We now establish the bound to be shown. Each measurement matrix can be expanded 
as 

1 9 

^i = - Y1 ■ ^(0, Im), * = 1, . . . , U. 

® fe=l 


Accordingly, we have 

n 




n ( q 


^ik^ik 


i—1 \ fc=l 


< max • 

l<2Xn 


: 


k^l 


^ max Amax ( ^ ^ ^ik^ik 1 I” Aniax(rnq) 

J) 


nq 




i—l k—1 


where Tnq follows the distribution of Eat in Theorem IF.ll with N = nq. For the first 
term, applying Theorem IF.ll with N = q and <5 = log(n)/q and using the union 

bound, we obtain that 


P ^Aniax 'y ^ > 


/ y/q + y/m + y/ 4m log n 




79 


< exp(—(2m — 1) logn). 


Applying Theorem IF. II to T n with S = Ijy/q, we obtain that 

2 \ 


P I A„iax(rn5) > ( 1 ■!-;= + 


y/q \ nq 

Combining the two previous bounds yields the assertion. 


< exp(—n/2). 


G Proof of Proposition |4] 


In the sequel, we write IIt and Htj. for the orthogonal projections on T and T-*-, 
respectively. Note first that since the are zero, any minimizer E satisfies 

A(E) = A(E*) A(A) = 0 A’(At) + X{Aj±) = 0 (32) 

where At = IIt A and Atj- = IlTr A, where we recall that A = E — E*. Note that 

since E* = IItE*, for E to be feasible, it is necessary that At^ A 0. 

Suppose first that 'r^(T) = 0. Then there exist 0 € T and A G (m) fl such that 
A(0) + A(A) = 0. Hence, for any E* G T with E* + 0 ^ 0, the choices At = 0 and 
Atj- = A ensure that E is feasible and that (1321) is satisfied. Since A is contained in 
the Schatten 1-norm sphere of radius 1, it is necessarily non-zero and thus E 7^ E*. 

If = 0, there exists 0 7^ 0 G T such that A(0) = 0. Consequently, for any 

E* G T n S™ with E = E* -I- 0 ^ 0, (15^ is satisfied with E 7^ E*. 

Conversely, if t^(T) > 0, (15^ cannot be satisfied for Atj- A 0, Atj- 7^ 0. Otherwise, 
we could divide by tr(ATJ-), which would yield 

A’(AT/tr(ATJ-)) -|- A’(Atj-/ tr(ATJ-)) = 0, 
eT eS+(m)nT-i- 

which would imply t^(T) = 0. Therefore, we must have Atj- = 0 and A’(At) = 0, 
which implies At = 0 as long as ^^(T) > 0. 
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H Proof of Theorem [2] 


Let A = S — E*, At = IItA and At-l = IIt-lA A 0 as in the preceding proof. We 
start with the following analog to (IHTl) 

-||A’(A)||2 = 1||A’(At + At.)||2 < 2Ao(||At|1i + llVlIi) (33) 

n n 


Suppose that ^ 0. We then have 


IIAt^II? 


A” 


Ai 


IA 




A. 


Ti 


T-i-lll , 




< 2Ao(||At||i + ||Atj-||i) 


Since At/||Atj-||i € T and Atj-/||Atj-|| i = ATi/tr(ATJ-) £ we obtain that 

the term inside the curly brackets is lower bounded by t^(T) and thus 


|Atj-||i < 


2Ao / ^ ||At||i 


HT) 


^T-L 111 , 


(34) 


On the other hand, expanding the quadratic term in (j33ll . we obtain that 


i||A’(AT)||2 - - (-r(AT), A’(At.)) < i||A’(A)||2 < 2 Ao(||At||i + ||At.||i) 
n n \ In 

-||A’(At)||2 < 2Ao(||At||i + ||Atj-||i) + 2^(T)||At||i||Atj-||i 
n 

(^^(T)||AtIIi < 2Ao(||At||i + ||Atj-|1i) + 2/r(T)||AT||i||ATJ-||i 
^ 2Ao ^1 + ||Atj-||i/||At||i^ + 2/x(T)||Atj-||i 


=> I|At||i < 


HT) 


(35) 


We now distinguish several cases. 

Case 1 : ||At||i < ||Atj-||i. It then immediately follows from (IM)) that 


l|A||i< 


8An 


=-T.. 


r2(T) 

Case 2a: ||At||i > ||Atj-||i and ||ATi|li < 4Ao/((>^(T). From (1551) . we first get 

4Ao + 2/r(T)||ATJ-||i 


|At||i < 


^(T) 


and thus 




Case 2b: ||At||i > ||Atj-||i and ||Atj-||i > 4Ao/((>^(T). Plugging ([32]) into 
obtain that 

HA II < ^^0 I 4Ao/r(T) 

II '^-"lli - r2(T) t2(T)(()2(T)' 

Substituting this bound back into (1571) yields 

4Ao 8Ao/i(T) 8Ao/r^(Tr) 


(36) 

(37) 

(38) 
we 


I|At||i < 


02(T) t2(T)(()2(T) ((>4(T)t2(T)’ 

Collecting terms, we obtain altogether 


|A||i < 8Ao 


m(T) (Z , ^(T) 


+ 4Ao 


r2(T)(()2(T) V2 (/>2(T) 
Combining (1561) . (1581) and (1591) yields the assertion. 


1 


02 (T) t2(T) 


=:Ti. 


(39) 
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